Dynamic Partition

There is another way of partitioning where we let the Hive engine dynamically determine the partitions based on the values of the partition column. Instead of loading each partition with single SQL statement as shown above, which will result in writing lot of SQL statements for huge no of partitions, Hive supports dynamic partitioning with which we can add any number of partitions with single SQL execution. Hive will automatically splits our data into separate partition files based on the values of partition keys present in the input files.It gives the advantages of easy coding and no need of manual identification of partitions. This dynamic partition suits well for our example requirement on user records provided above.

Before using this, we have to set a property that allows dynamic partition:

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;

Steps Involved in Creating Dynamic Pratation
  • Create Temporary (Non-Partiton Table)
  • Load data into Temporary Table
  • Create partition table.
  • Enable Partitions.
  • Insert data to partition table from Temporary table.
  • Drop Temp table.
Create Normal Table 
CREATE TABLE  customers_part (
   customer_id  int,
   customer_fname  string,
   customer_lname  string,
   customer_email  string,
   customer_password  string,
   customer_street  string,
   customer_zipcode  string,
   customer_state string,
   customer_city string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'warehouse/customers_part'

Create Partition Table 
CREATE TABLE  customers_part (
   customer_id  int,
   customer_fname  string,
   customer_lname  string,
   customer_email  string,
   customer_password  string,
   customer_street  string,
   customer_zipcode  string
) PARTITIONED BY (customer_state string,customer_city string,) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'warehouse/customers_part'

insert into customers_part partition(customer_city,customer_state)
select
customer_id
,customer_fname 
,customer_lname 
,customer_email 
,customer_password 
,customer_street 
,customer_zipcode 
,customer_state
,customer_city
from customers

Note:-Please note that the partitioned column should be the last column in the select clause.


While loading the data into a table using dynamic partition if any null or empty value comes for a defined partition column, then it uses to create a default partition named __HIVE_DEFAULT_PARTITION__ at HDFS location and dump those records in that partition.

Suppose, I create a table that contains details of all the transactions done by the customers of year 2016:

CREATE TABLE transaction_details 
(
cust_id INT, 
amount FLOAT, 
month STRING, 
country STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;

Now, after inserting 50,000 tuples in this table, I want to know the total revenue generated for each month. But, Hive is taking too much time in processing this query. How will you solve this problem and list the steps that I will be taking in order to do so?

We can solve this problem of query latency by partitioning the table according to each month. So, for each month we will be scanning only the partitioned data instead of whole data sets. As we know, we can’t partition an existing non-partitioned table directly. So, we will be taking following steps to solve the very problem:

1. Create a partitioned table, say partitioned_transaction:  
CREATE TABLE partitioned_transaction 
(cust_id INT, 
amount FLOAT, 
country STRING
) PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
2. Enable dynamic partitioning in Hive:
3. Transfer the data from the non – partitioned table into the newly created partitioned table:

INSERT OVERWRITE TABLE partitioned_transaction PARTITION (month) 
SELECT 
cust_id, 
amount, 
country, 
month 
FROM transaction_details;
Now, we can perform the query using each partition and therefore, decrease the query time.

How can you add a new partition for the month December in the partitioned table?

ALTER TABLE partitioned_transaction ADD PARTITION (month=’Dec’) LOCATION  ‘/partitioned_transaction’;

create table sample_table
(
    id int
)
partitioned by (day date)
LOCATION  '/user/hive/warehouse/dwdev.db/sample_table';

hadoop fs -mkdir hdfs://localhost:9000/user/hive/warehouse/dwdev.db/sample_table/day=2017-03-02;
hadoop fs -mkdir hdfs://localhost:9000/user/hive/warehouse/dwdev.db/sample_table/day=2017-03-04;
msck repair table sample_table;

No comments:

Post a Comment